Annotation Tools for Large-Scale Corpus Development: Using AGTK at the Linguistic Data Consortium
نویسندگان
چکیده
Large-scale corpus development demands substantial infrastructure. As part of this infrastructure, the Linguistic Data Consortium (LDC) has adopted the Annotation Graph Toolkit (AGTK) as a primary resource for annotation tool development. This paper reports on LDC’s experiences using AGTK to develop and implement highly customized annotation tools for a variety of large-scale corpus creation efforts. We describe two primary tools that are currently in active use at LDC, one speechand one text-based, as well as other new AGTKbased annotation tools. We also describe the use of AGTK to develop tools for comparing and adjudicating divergent annotations in order to produce gold standard evaluation data and to measure inter-annotator consistency. Finally, we discuss various issues in creating AGTK-based tools across a wide range of annotation tasks and divergent research areas.
منابع مشابه
A New Phase in Annotation Tool Development at the Linguistic Data Consortium: The Evolution of the Annotation Graph Toolkit
The Linguistic Data Consortium (LDC) has created various annotated linguistic data for a variety of common task evaluation programs and projects to create shared linguistic resources. The majority of these annotated linguistic data were created with highly customized annotation tools developed at LDC. The Annotation Graph Toolkit (AGTK) has been used as a primary infrastructure for annotation t...
متن کاملAnnotation Tool Development for Large-Scale Corpus Creation Projects at the Linguistic Data Consortium
The Linguistic Data Consortium (LDC) creates a variety of linguistic resources – data, annotations, tools, standards and best practices – for many sponsored projects. The programming staff at LDC has created the tools and technical infrastructures to support the data creation efforts for these projects, creating tools and technical infrastructures for all aspects of data creation projects: data...
متن کاملModels and Tools for Collaborative Annotation
The Annotation Graph Toolkit (AGTK) is a collection of software which facilitates development of linguistic annotation tools. AGTK provides a database interface which allows applications to use a database server for persistent storage. This paper discusses various modes of collaborative annotation and how they can be supported with tools built using AGTK and its database interface. We describe ...
متن کاملLanguage Resource Creation and Distribution at the Linguistic Data Consortium: A Progress Report
Changes in the supply of and demand for language resources continues to affect the role of large data centers such as the Linguistic Data Consortium (LDC) and European Language Resource Center (ELRA) within the research communities they serve. The past few years have seen increased demand for: intensively multi-modal resources, larger data sets in high-density languages and new data in low dens...
متن کاملMultilevel corpus analysis: generating and querying an AGset of spoken Italian (SpIt-MDb)
In this paper we present an application of AGTK to a corpus of spoken Italian annotated at many different linguistic levels. The work consists of two parts: a) the presentation of AG-SpIt, a toolkit devoted to corpus data management that we developed according to AGTK proposals; b) the presentation of corpus’ structure together with some examples and results of cross-level linguistic analyses o...
متن کامل